Scaling Laws for Neural Language Models

GPT-3(2020/06)の前くらいのタイミング(2020/01)

Figure 1

コンピュータ資源を投入するほど、右下下がりでtest lossが下がる

計算とデータとパラメタを増やしましょう

データはtoken数

計算能力を固定すると、データよりもパラメタを優先的に増やすとよい

Figure 2

パラメタ数が多いほど、ロスの下がりやすさ

Figure 6

TransformerだけでなくLSTMもスケール則を検証

layer数（ハイパーパラメタ）

Figure 5

アスペクト比（Attention vs FFN。縦横比）

違いはないという結論らしい

1.3 Notation

L – the cross entropy loss in nats

N – the number of model parameters, excluding all vocabulary and positional embeddings

C ≈ 6NBS – an estimate of the total non-embedding training compute

where B is the batch size, and S is the number of training steps (ie parameter updates).

GPT-3 Language Models are Few-Shot Learners でスケール則が成り立つことを示した